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1.0  Technical  Objectives 
1.1  Fault  Tolerant  MACH 


The  current  trend  in  computing  is  towards  open  systems  employing  hardware  and  software 
from  multiple  vendors  tied  together  by  portable  software  packages.  The  UNIX  operating 
system  ushered  in  a  new  era  of  user  freedom  from  propriety  hardware/software  platforms 
that  commercial  vendors  used  to  capture  customers.  UNIX  provided  a  portable 
environment  wherein  a  piece  of  software  developed  on  one  system  could  be  moved  to 
another  system  with  minor  effort.  Since  UNIX  was  available  on  a  wide  variety  of 
platforms,  the  user  could  purchase  the  most  cost  effective  hardware  without  incurring  an 
enormous  software  redesign  effort.  MACH  extends  the  UNIX  portability  beyond  the 
hardware  platform  by  providing  a  uniform  treatment  of  both  networked  (often  called 
distributed  computing)  and  parallel  processing  (often  called  shared  memory)  computational 
models.  MACH  sets  a  trend  for  contemporary  operating  systems  by  employing  a 
microkernel  whereby  the  basic  operating  system  functions,  such  as  allocate  memory  or  start 
up  a  task,  are  implemented  in  the  microkernel.  Traditional  operating  system  services,  such 
as  a  file  system,  are  implemented  as  servers  executing  on  top  of  the  microkernel. 

As  computing  systems  assume  more  and  more  critical  tasks  wherein  an  error  can  have 
catastrophic  consequences,  attributes  of  computer  systems  other  than  just  cost  or 
performance  become  more  important.  One  such  attribute  is  the  ability  to  tolerate  a  variety 
of  errors  ranging  from  physical  defects  to  environmentally  induced  changes  to  human 
errors.  There  have  been  fault  tolerant  commercial  computers  for  almost  two  decades.  Most 
fault  tolerant  systems  have  involved  proprietary  hardware  and  software,  locking  users  into 
a  single  vendor.  Furthermore  the  user  had  to  select  either  fault  tolerance  or  performance  - 
an  application  could  not  decide  to  place  some  resources  on  improving  fault  tolerance  and 
the  remaining  resources  on  performance.  No  trade-off  between  fault  tolerance  and 
performance  was  possible.  While  a  network  of  distributed  computers  running  open  system 
software  has  a  natural  degree  of  redundancy  so  that  physical  hardware  failures  could  be 
tolerated,  software  to  take  advantage  of  this  feature  has  been  slow  to  develop.  Research 
has  produced  software  which  can  tolerate  network  node  failure  assume  fail-fast  network 
nodes,  implying  that  faults  are  either  detected  or  recovered  from  before  erroneous  output 
can  enter  the  network.  Current  open  systems  such  as  MACH  do  not  implement  the  fail-fast 
model. 
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The  goal  of  this  research  was  to  design  and  implement  a  Fault  Tolerant  version  of  the 
MACH  operating  system  (FT  MACH)  that  adhered  to  the  fail-fast  model  and  allows  the 
user  to  select  the  amount  of  fault  tolerance  (including  none)  to  be  allocated  to  each 
application. 

1.2  Ultra-Dependable  Real-Time  Computing 

Over  the  past  20  years,  benchmarks  have  evolved  from  simple,  synthetic  programs  to 
comprehensive  application  suites  for  measuring  the  performance  of  computer  systems,  both 
for  users  of  systems  and  for  designers  of  systems.  The  benchmarks  have  fostered  a  sense 
of  competition  among  manufacturers  to  produce  faster  systems.  Today  there  are  no 
benchmarks  to  measure  the  robustness  and  dependability  of  computer  systems.  Without 
benchmarks  it  is  difficult  to  compare  the  robustness  and  dependability  of  individual 
techniques  or  of  complete  systems.  In  addition,  relative  progress  cannot  be  measured.  The 
objective  of  Robustness  Benchmarks  is  to  define  measures  of  robustness,  develop 
methodologies  for  measuring  robustness,  and  to  implement  portable  software  that  can  be 
used  to  evaluate  fault  tolerant  systems. 

2.0  Technical  Approach 

2.1  Fault  Tolerant  MACH 

The  initial  focus  was  on  adding  error-detection  mechanisms  to  various  features  of  MACH. 
The  first  step  added  observability  and  controllability  to  services  provided  by  the  MACH 
run-time  library.  Library  calls  are  made  to  an  application  built  upon  the  microkernel.  The 
libra^  server  has  been  modified  so  that  all  calls  are  encapsulated  into  a  standard  "envelop" 
providing  a  "flight  record"  of  time,  calling  parameters,  and  returns.  The  envelop  concept 
has  been  formalized  as  the  sentry  model.  In  this  model,  MACH  services  are  viewed  as  a 
combination  of  all  possible  execution  paths  and  data  structures  involved  in  serving  a 
request.  Sentries  are  placed  at  the  entry  and  exit  points  of  services  in  order  to  perform  fault 
management.  Hence,  in  the  sentry  model  a  MACH  call  can  be  guarded  by  more  than  one 
pair  of  entry/exit  sentries.  Sentries  have  been  categorized  to  reflect  their  structure  and 
functionality.  Four  types  of  sentries  have  been  defined:  Fault  Detection  Sentries  (FDS), 
Fault  Recovery  Sentries  (FRS),  Fault  Monitoring  Sentries  (FMS),  and  Validation/Fault 
Injecrion  Sentries  (VFS).  Fault  Monitoring  Sentries  have  been  implemented  for  user  level 
operating  system  calls  in  MACH  3.0.  These  monitoring  sentries  report  call  entry  and  exit 
time  stamps  as  well  as  input/output  parameters.  The  ability  to  trace  system  behavior, 
particularly  in  the  vicinity  of  an  error,  has  been  very  useful  at  identifying  software  "bugs" 
that  appeared  under  stressing  workloads. 

2.2  Ultra-Dependable  Real-Time  Computing 

A  methodology  has  been  developed  for  the  construction  of  user  mode  modular  robustness 
benchmarks.  The  system  is  stressed  with  incorrect  system  calls  representative  of  the  type 
of  errors  made  by  application  designers  or  corrupted  data.  The  modular  benchmarks  focus 
on  single  errors  to  enhance  repeatability  and  to  isolate  the  corrupting  input.  The 
benchmarks  are  executed  on  the  actual  target  system  is  contrast  to  fault  injection  which 
typically  requires  modifications  to  the  system,  simulation  which  is  an  imperfect  model  of 
the  system,  and  physical  methods  such  as  heavy  ion  bombardment  and  pin-level  injection 
which  exposes  systems  to  random  errors  and  possible  damage. 

The  Robustness  Benchmarks  target  specific  functions  of  the  operating  system  (such  as  the 
memory  allocator,  the  file  system,  the  communication  subsystem,  the  nintime  library,  etc.) 
and  define  a  class  of  feasible  faults  (such  as  passing  random  characters  that  may  have  been 


generated  through  communication  line  noise  from  remote  computing  sites)  that  are  deemed 
most  likely  to  occur  with  respect  to  that  operating  system  feature.  Each  benchmark 
generates  a  series  of  test  cases  and  keeps  track  of  the  number  of  cases  which  are 
successfully  detected.  The  benchmark  is  robust  enough  to  maintain  accurate  statistical 
count  even  if  one  of  the  tests  crashes  the  operating  system. 

3.0  Accomplishments 

3.1  Fault  Tolerant  MACH 

The  first  Fault  Recovery  Sentry  for  MACH  3.0  implemented  journalling.  The  Fault 
Monitoring  Sentries  are  used  to  capture  keyboard/mouse  inputs  as  well  as  operating  system 
call  input/output  parameters  from  application  programs  and  to  journal  these  parameters  onto 
permanent  stable  storage.  For  typical  interactive  workstation  user  sessions  (as  opposed  to 
compute-intensive  workstation  usage)  journalling  requires  approximately  10  MBytes  per 
hour  of  storage  with  CPU  overheads  ranging  from  a  few  percent  to  unnoticeable  for 
applications  such  as  word  processing,  drawing  packages,  and  desktop  publishing. 
Recovery  after  a  crash  is  totally  automatic  and  all  data,  except  perhaps  for  the  last 
keystroke,  is  automatically  recovered  through  the  replay  of  the  journal.  Journal  replay  time 
is  a  function  of  the  amount  of  user  interaction  and  the  amount  of  computationally-intensive 
time.  For  interactive  user-oriented  sessions  the  replay  time  is  typically  around  ten  percent 
of  the  original  session.  Journalling  Fault  Recovery  Sentries  has  been  demonstrated  with  a 
wide  variety  of  Unix-based  applications  and  do  not  require  detailed  knowledge  of  the 
application's  internal  stmcture. 

The  second  Fault  Recovery  Sentry  for.  MACH  3.0  implemented  check-pointing  and 
rollback.  A  novel  solution  to  capturing  a  checkpoint  of  multiple  concurrent  tasks  coupled 
with  journalling  reduce  the  amount  stable  storage  requirements  to  a  total  of  10  MBytes  and 
recovery  time  to  a  few  minutes  with  check-pointing  occurring  as  a  background  activity. 

3.2  Ultra-Dependable  Real-Time  Computing 

Based  upon  our  previous  study  of  Robustness  Benchmarks  the  technology  was  applied, 
under  separate  funding,  to  an  Air  Force  Satellite  computer  ASCM  based  upon  the  1750A 
instruction  set  by  the  IBM  Federal  Systems  Division  at  Manassas,  Virginia.  Using  a 
systematic,  modular  approach  the  parameters  for  operating  system  calls  were  identified  as 
well  as  likely  error  manifestations.  A  watchdog  program  executed  a  series  of  operating 
system  calls  with  a  variety  of  the  illegal  parameters.  Almost  20,000  tests  were  executed 
resulting  in  dozens  of  cases  causing  warm  restarts  of  computer  modules  and  one  case  of  a 
cold  restart.  Approximately  one-fourth  of  the  operating  system  calls  were  thus  tested  in 
this  ADA  environment.  The  Fault  Monitoring  Sentries  have  been  used  to  observe  MACH 
3.0  system  behavior  prior  to  a  benchmark  induced  fault.  If  the  fault  results  in  a  system 
crash,  we  attempt  to  generalize  the  system  state  that  induced  the  crash  so  that  a  robustness 
benchmark  can  be  designed  to  probe  that  single  feature. 

The  Robustness  Benchmark  methodology  has  been  applied  to  an  aerospace  fault  tolerant 
computer.  It  has  work  has  been  extended  and  ported  to  test  the  Mach  operating  system. 
Data  is  currently  being  collected  and  a  paper  will  be  written  shortly.  A  kernel-level  fault 
injection  and  test  environment  is  being  implemented  utilizing  the  Sentry  mechanism.  This 
environment  will  be  compared  to  the  Mach  Robustness  Benchmark  results. 


4.0  Importance  of  the  Accomplishments 

The  concept  of  Sentries  has  been  defined,  designed,  implemented,  and  demonstrated. 
Sentries  are  implemented  as  middleware  between  unmodified  application  code  and 
unmodified  operating  systems.  Sentries  intercept  operating  system  service  requests  from 
the  application.  Sentries  can  provide  services  both  on  entry  to  the  operating  system  and 
upon  exiting  back  to  the  application.  Services  provided  by  the  sentries  enhance  the 
observability  and  controllability  of  the  system.  Several  classes  of  services  have  been 
identified  including:  journalling  for  roll-back,  assertions  for  error  detection,  replication  for 
fault  tolerance,  fault  injection  for  validation,  etc. 

Sentries  represent  a  framework  for  producing  highly-dependable  systems  from  commercial 
off-the-shelf  hardware  and  software.  Unmodified  legacy  application  software  can  be 
turned  into  fault-tolerant  services.  The  sentry  mechanism  has  been  demonstrated  with 
journalling  applied  to  the  Mach  operating  system  for  workstations  and  the  Windows 
operating  system  for  personal  computers.  Journalling  sentries  allow  complete  recovery  of 
even  multitasking  legacy  application  software  from  errors  induced  by  hardware  (e.g.  power 
outage),  software  (undoing  a  system  call  which  led  to  a  system  crash),  and  operator 
mist^es  (e.g.  undoing  the  previous  command). 

The  Robustness  Benchmark  methodology  has  been  effective  at  discovering  design  flaws  in 
error  detection/handling  mechanisms  is  both  commercial  and  dedicated  aerospace  fault- 
tolerant  systems. 

5.0  Transitions  of  Research 

5.1  To  Navy  and  DOD  Organizations 

5.2  To  Industry 

The  commercial  relevance  of  the  Sentry  technology,  supported  under  these  ONR  grants, 
has  been  recognized  through  the  award  by  ARPA  of  a  SBIR  to  Systems 
Technology/Development  Corporation  (ST/DC)  of  Reston,  Virginia.  Initial  results  during 
Phase  I  of  the  SBIR  have  demonstrated  that  Sentry  technology  provides  highly  effective 
levels  of  fault  tolerance,  with  one  to  two  orders  of  magnitude  reduction  in  costs  compared 
to  proprietaiy  hardware/software  solutions.  Unlike  most  commercial  fault  tolerant 
systems,  which  require  users  to  modify  and  recompile  their  application  code,  the 
application  transparency  features  of  Sentries  achieves  fault  tolerance  capabilities  without 
modifying  application  code.  A  commercial  product  based  on  the  sentry  technology  is 
planned  for  the  first  quarter  of  1996.  Negotiations  are  underway  with  two 
developers/distributors  of  PC  based  software  for  the  Sentry  based  commercial  product. 
Symantec,  Dynamics,  and  Martin  Marietta  have  expressed  firm  interest  to  help  transition 
this  technology  into  their  current  products  and  applications. 
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